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ABSTRACT 

Information- Geometric Optimization (IGO) is a unified frame- 
work of stochastic algorithms for optimization problems. 
Given a family of probability distributions, IGO turns the 
original optimization problem into a new maximization prob- 
lem on the parameter space of the probability distributions. 
IGO updates the parameter of the probability distribution 
along the natural gradient, taken with respect to the Fisher 
metric on the parameter manifold, aiming at maximizing an 
adaptive transform of the objective function. IGO recovers 
several known algorithms as particular instances: for the 
family of Bernoulli distributions IGO recovers PBIL, for 
the family of Gaussian distributions the pure rank-/x CMA- 
ES update is recovered, and for exponential families in ex- 
pectation parametrization the cross-entropy /ML method is 
recovered. 

This article provides a theoretical justification for the IGO 
framework, by proving that any step size not greater than 1 
guarantees monotone improvement over the course of opti- 
mization, in terms of g-quantile values of the objective func- 
tion /. The range of admissible step sizes is independent 
of / and its domain. We extend the result to cover the 
case of different step sizes for blocks of the parameters in 
the IGO algorithm. Moreover, we prove that expected fit- 
ness improves over time when fitness-proportional selection 
is applied, in which case the RPP algorithm is recovered. 

Categories and Subject Descriptors 

G.1.6 [Mathematics of Computing]: Numerical Analy- 
sis — Optimization 

General Terms 

Theory 
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1. INTRODUCTION 

Information-Geometric Optimization (IGO) 5 is a unified 
framework of model based stochastic search algorithms for 
any optimization problem. As typified by Estimation of Dis- 
tribution Algorithms (EDA) [IS], model based randomized 
search algorithms build a statistical model Pq on the search 
space X to generate search points. The parameters 9 of the 
statistical model are updated over time so that the probabil- 
ity distribution hopefully concentrates around the minimum 
of the objective function. In most model based algorithms 
such as EDAs and Ant Colony Optimization (ACO) algo- 
rithms [10] , parameter calibration is based on the maximum 
likelihood principle or other intuitive ways. IGO, unlike 
them, performs a natural gradient ascent of 6 in the param- 
eter space O, having first adaptively transformed the objec- 
tive function into a function on O. This construction offers 
maximal robustness guarantees with respect to changes in 
the representation of the problem (change of parametriza- 
tion of the search space, of the parameter space, and of the 
fitness values). 

Importantly, the IGO framework recovers several known al- 
gorithms [5] Section 4]. When IGO is instantiated using the 
family of Bernoulli distributions on {0, one obtains the 
population based incremental learning (PBIL) algorithm [5J. 
When using the family of Gaussian distributions on R d , IGO 
instantiates as a variant of covariance matrix adaptation evo- 
lution strategies (CMA-ES), the so-called pure rank-/i CMA- 
ES update [11] . Moreover, when using an exponential family 
with the expectation parameters, the IGO instance is equiv- 
alent to the cross-entropy method for optimization [7]. Of 
course, the IGO framework not only provides information- 
theoretic derivations for existing algorithms but automat- 
ically offers new algorithms for possibly complicated opti- 
mization problems. For instance, the IGO update rule for 
the parameters of restricted Boltzmann machines has been 
derived [5]. 

Theoretical justification of the IGO framework, therefore, is 
important both to provide a theoretical basis for the recov- 
ered algorithms and to make the design principle for future 
algorithms more reliable. Here we focus on providing a mea- 
sure of "progress" over the course of IGO optimization, in 
terms of quantile values of the objective function. 

Parameter updates by gradient ascent are somewhat justi- 
fied in general, at least for infinitesimally small steps, be- 
cause the gradient points to the direction of steepest ascent 



of a function. However, this argument does not apply to the 
IGO algorithm: as the objective function is adaptively trans- 
formed in a time-dependent way, the function on which the 
gradient is computed changes over time, so that its increase 
does not necessarily mean global improvement. Still, the 
IGO framework comes with a guarantee that an infinites- 
imally small IGO step along the natural gradient leads to 
monotone improvement of a specified quantity, for any ob- 
jective function / [5j Proposition 5]: a result from 5 a is that 
the q-quantile value of the objective function monotonically 
improves along the natural gradient. This result is limited 
to the exact IGO flow, i.e., an infinite number of sample 
points is considered and the step size of the gradient ascent 
is infinitesimal. Still this ensures that the randomized algo- 
rithm with large sample size stays close to the deterministic 
trajectory with infinite samples with high probability, pro- 
vided the step size is sufficiently small. Now the question 
arises whether actual, non-infinitesimal step sizes still ensure 
monotone q-quantile improvement. 

In this article, we prove that any step size not greater than 
f guarantees monotone q-quantile value improvement in the 
IGO algorithm for an exponential family with a finite step 
size (Theorem [6]), thus extending the previous result from 
infinitesimal steps with continuous time to more realistic 
algorithmic situations. For instance, this ensures monotone 
q-quantile improvement in PBIL (using uniform weights, see 
below) , or in the cross-entropy method for exponential fam- 
ilies in expectation parameters. Interestingly, our results 
show that the admissible step sizes in IGO are independent 
of the objective function f, at least for large population sizes 
(this stems from the many invariance properties built into 
IGO). 

We further extend the result by defining blockwise updates 
in IGO where different blocks of parameters are adjusted 
one after another with different step sizes. Our motivation 
is that in practice the pure rank-/x update CMA-ES updates 
the mean vector and the covariance matrix with different 
learning rates. We show that the blockwise update rule 
recovers the pure rank-/x CMA-ES update using different 
learning rates for the mean vector and the covariance matrix 
(Proposition [9|. We prove that any distinct step sizes less 
than 1 guarantee monotone q-quantile improvement, which 
justifies the parameter setting used for the CMA-ES in prac- 
tice (Theorem [TO]). 

Other examples fitting into this framework are the Relative 
Payoff Procedure (also known as expectation-maximization 
for reinforcement learning) [9112] , or situations where fitness- 
proportional selection is applied using exponential families 
(Theorem If 2|) . The RPP is considered as an alternative to 
gradient based methods that allows to use relatively large 
learning rates. As it turns out, the RPP can be described 
as a natural gradient based algorithm with step size f , and 
our result is an extension of the proof of its monotone im- 
provement to generic natural gradient algorithms. 

The article is organized as follows. In Section [2] we ex- 
plain the IGO framework and its implementation in practice. 
IGO-maximum likelihood (IGO-ML), a variant of IGO as a 
maximum likelihood, is presented, followed by the relation 
between the IGO algorithm, IGO-ML and the cross-entropy 



method for optimization, for exponential families of distri- 
butions. In Section O we prove monotone q-quantile im- 
provement in IGO-ML. The result is extended by defining 
blockwise IGO-ML, and q-quantile improvement in block- 
wise IGO-ML is proved. We also provide a result with finite 
but large population sizes. Section [4] is devoted to the nat- 
ural gradient algorithm with fitness-proportional selection 
scheme, where monotone improvement of expected fitness is 
proven. A short discussion in Section [5] closes the article. 

2. INFORMATION-GEOMETRIC 
OPTIMIZATION 

In this article, we consider an objective function / : X — > R 
to be minimized over any search space X. The search space 
X may be continuous or discrete, finite or infinite. 

Let {Pe} be a family of probability distributions parametrized 
by 8 £ and let pe be the probability density function in- 
duced by Pe w.r.t. an arbitrary reference measure da; on A, 
namely, Pe(dx) = pe(x)dx. Given a family of probability 
distributions, IGO [5] evolves the probability distribution 
P e t at each time t so that higher probabilities are assigned 
to better regions. To do so, IGO transforms the objective 
function f(x) into a new one WL (x), defines a function on O 
to be maximized: J(9 | 8 t ) := Ep e \WL (as)], and performs the 
steepest gradient ascent of J(0 | #') on O. Hopefully, after 
some time the distribution Pgt concentrates around minima 
of the objective function. 

IGO is designed to exhibit as many invariance properties as 
possible .5, Section 2]. The first property is invariance under 
strictly increasing transformations of /. For any strictly in- 
creasing g, IGO minimizes go f as easily as /. This property 
is realized by a quantile based mapping of / to Wg t at each 
time. The second property is invariance under a change of 
coordinates in X, provided that this coordinate change glob- 
ally preserves the family of probability distributions {Pe}- 
For example, the IGO algorithm for Gaussian distributions 
on R d is invariant under any affine transformation of the co- 
ordinates whereas the IGO algorithm for isotropic Gaussian 
distribution is only invariant under any translation and ro- 
tation. Invariance under X-coordinate transformation is one 
of the key properties for the success of the CMA-ES. The 
last property is invariance under reparametrization of 9. At 
least for infinitesimal steps of the gradient ascent, IGO fol- 
lows the same trajectory on the parameter space whatever 
the parametrization for 9 is. This property is obtained by 
considering the intrinsic (Fisher) metric on the parameter 
space O and defining the steepest ascent on O w.r.t. this 
metric, i.e., by using a natural gradient. 

The study of the intrinsic metric on the parameter space 
of the probability distribution, called a statistical manifold, 
is the main topic of information geometry [4j. The most 
widely used divergence between two points on the space of 
probability distributions is the Kullback-Leibler divergence 
(KL divergence) 

D K L(Pe\\P g <):= fln^fi-P e (dx) . 
J Pe'{x) 

The KL divergence is, by definition, independent of the 
parametrization 9. Let 8' = 9 + 89. Then, the KL divergence 



between Pg and Pg+sg expands [13] as 

D K h(Pe || Pe+se) = \s9 T lg89 + O(||<50|| 3 ) 



(1) 



where ||-|| is the Euclidean norm and Xg is the Fisher infor- 
mation matrix at 9 defined as 



e)ij 



d\npg(x) d\npg(x) 



Pg(dx) 



d 2 In pg(x) 



Pg(dx) 



The expansion fT| follows from the well-known fact that the 
Fisher information matrix is the Hessian of KL divergence. 
By using the KL divergence, we have the following property 
of the steepest ascent direction (see [3], Theorem 1, or [5], 
Proposition 1). 



Statement 1. Let g be a smooth function on the pa- 
rameter space ©. Let 9 £ © be a nonsingular point where 
^eg{9) 7^ 0. Then the steepest ascent direction of g is given 
by the so-called natural gradient X7gg (9) := Xg 1 Vgg{9) . More 
precisely, 

Vgg ^ = lim - arg max g{9 + 59) . 

\\^eg(9)\\ e S0 such that 

D K L(Pe\\Pe+6e)<£ 2 /2 



Since KL divergence does not depend on parametrization, 
the natural gradient is invariant under reparametrization of 
9. Hence, the natural gradient step — steepest ascent step 
w.r.t. the Fisher metric — is invariant at least for an infinites- 
imal step size Section 2.4]. 

2.1 Algorithm Description 

For completeness, we include here a short description of the 
IGO algorithm. We refer to [5] for a more complete presen- 
tation. 

First, IGO transforms the objective function into an adap- 
tive weighted preference by a quantile based approach. This 
results in a rank based algorithm, invariant under increasing 
transformations of the objective function. Define the lower 
and upper Pg-f- quant iles of x £ X as 

qf{x):=Pg[y:f{y)<f{x)\ 

qf(x):=Pg[y:f(y)<f(x)] . 

The lower quantile value qf (x) is the probability of sampling 
strictly better points than x under the current distribution 
Pg, while the upper quantile value qf{x) is the probability 
of sampling points better than or equivalent to x. Given 
a weight function (selection scheme) w : [0, 1] — >■ R that is 
non-increasing, the weighted preference Wg(x) is defined as 



Wf{x) 



'w{qf{x)) 



iiqf{x) = qg(x), 



ui(u) du otherwise. 



(2) 

This way, the quality of a point is measured by a function 
of the Pe-quantile in which it lies. A typical choice of the 
selection scheme w is w(u) — l[ u < q ]/q, < q < 1. We 
call it the q-truncation selection scheme. Using (/-truncation 



amounts, in the final IGO algorithm, to giving the same 
positive weight to a fraction q of the best samples in a pop- 
ulation, and weight to the rest, as is often the case in 
practice. 

Next, IGO turns the original objective function / on the 
search space X into a function J(9 j on the statistical 
manifold O by defining 

J(6\e*):=Ep t \wf(x)] . 



(3) 
Then, 



(4) 



Note that J (9 \ 9 f ) depends on the current position I 
the gradient of J (9 \9 t ) is computed as 

VgJ{e\e t ) = V e V.p e \Wf t {x)} 

= V e f W£ t {x)p g (x)dx 

Wjj t (x)pg(x) X7g In pg(x) dx 

= Ep e lWf t (x)Vglnpg(x)] . 
Here we have used the relation Vpg(x) = pg(x) X71npg(x] 



Finally, IGO uses natural gradient ascent on the param- 
eter space. The natural gradient on the statistical mani- 
fold (0,2?) equipped with the Fisher metric I is given by 
the product of the inverse of the Fisher information matrix, 
Xg 1 , and the vanilla gradient. That is, the natural gradient 
of J(-|6» t ) at 9 is written as S7gJ(9\9 t ) = lg ll \JgJ{9\9 t ). 
According to @, we can rewrite the natural gradient as 

Ve J {9 1 9*) = Ep e [w/ t (x) Vg lnp e (x)] ■ (5) 

Introducing a finite step size St, IGO finally updates the 
parameter as follows 

nt + 5t a t | £j. ^7 7/a | a t\ 



9 l + 5tVgJ{9\ 



(6) 



2.2 Implementation and Recovering 
Algorithms 

When implementing IGO in practice, it is necessary to esti- 
mate the expectation in (0. The approximation is done by 
the Monte Carlo method using A samples taken from Pgt. 
Let be independent samples from Pgt . 



First, we need to approximate Wg t (xi) for each i 
Define 

rk<(x i ) :=#{j,/fe) </(*»)} 



= L 



A. 



rkSfa) :=#{J\ /(**)< /(*<)} 



let 



and set 



w(q)dq, V»6[1,A], 



(i-l)/A 



1 



Tk S (Xi) -T^iXi) 



rk-Oi) 

E 



(7) 



j=rk< (xi) + l 

Then \w~i is a consistent estimator of W^ t (xi), in other 

words, lfniA->oo AtUi = Wg t (xi) with probability one. (See 
the proof of Theorem 4 in [5].) If there are no ties in our 
sample, i.e. f(xi) ^ f(xj) for any i / j, then rk-(x^) = 



rk < (x i ) + 1 and simply reads Wi = w A <, Xi -., but ((TJ is a 
mathematically neater definition of rank based weights ac- 
counting for possible ties. In practice we just design the A 
weight values Wi, . . . , u>\, instead of the selection scheme w. 

In the rest of this article, we assume for simplicity that the 
selection weights Wi are non-negative and sum to 1. This 
is the case, for instance, if the selection scheme w is q- 
truncation as above. 



Next, Monte Carlo sampling is applied to the expectation 
©, using Wi and Xi. Replacing the expectation with a sam- 
ple average j Ysi=i an d Wl t (xi) with \wi, we get 



G* := ^ m Vg \np e {xi)\ g=e 



(8) 



Again, G* is a consistent estimator of the IGO step at 9 % , 
i.e., of V S J(6» | 9*)\ 9=e t. See Theorem 4 in [5]. 

Now the practical IGO algorithm implementation can be 
written in the form of a black-box search algorithm as 



1. Sample Xi, i — 1, . . . , A, independently from Pgt ; 

2. Evaluate f(xi) and compute rk-(ajj) and ik < (xi) 

3. Evaluate G* = J2i=i u>i lnpe(xi)\ 9=e t ; 

4. Update the parameter: 8 t+st = 6 t + St ■ G*. 



Finally, to obtain an explicit form of the parameter update 
equation, we need to know the explicit form of the natural 
gradient of the log-likelihood, which depends on a family of 
probability distributions and its parametrization. Explicit 
forms of Vg In pg(x) are known for some specific families of 
probability distributions with specific parametrizations, and 
the above algorithm sometimes coincides with several known 
algorithms. 



Example 1. The family of Bernoulli distributions on X = 
{0, l} d is defined as Pg(x) = ] ^ . 0' (] - 9 j ) 1 ~ x i . The 
natural gradient of the log-likelihood is readily computed as 
Vs \npe(x) = x — 9 (Section 4.1 in [5]). The natural gradient 
update reads 

e t+st = t + 5t^2w i (x i -e t ) . 

i=l 

This is equivalent to so-called PBIL (population based in- 
cremental learning, [6]). See Section 4.1 in [5] for details. 



Example 2. The probability density function of a multi- 
variate Gaussian distribution on X = R d with mean vector 
m and covariance matrix C, is defined as 

pe(x)= (det(27rG))~ 1/2 exp(-(x-m) T C" 1 (j:^m)/2) . 

When 8 — (m,G), the explicit form of X7\npg(x) is known 



natural gradient update reads 

e t+st = e t + st^2 



(x - m )(x — m ) 



C 



This is equivalent to the pure rank-^i CMA-ES update |11] 

m t+1 = m* + rj m J2 i=1 w l (x l - m') 
G t+1 ^C'+nc Eti & ((«< - mt )(^ " m T ~ c*) 
except that rj m = r\c — St in the natural gradient update. 

2.3 Maximum likelihood, IGO-ML, and cross- 
entropy 

In the sequel, we prove monotone improvement of the objec- 
tive function for a variant of IGO known as IGO-maximum 
likelihood (IGO-ML, introduced in [3 Section 3]). The re- 
sult is then transferred to IGO because the two algorithms 
exactly coincide in an important class of cases, namely, ex- 
ponential families using mean value parametrization. 

The IGO-ML algorithm 5] Section 3] updates the current 
parameter value 6 by taking a weighted maximum likeli- 
hood of the current distribution and the best sampled points. 
Assume as above that ^4 = 1. Then the IGO-ML update 
is defined as 



Q t+St 



arg max ^ (1 - St) E P t [\np e (x)} 

6 



+ 5ty^Wjlnpg(Xi)^ 



(9) 



to be V\npe(x) = 



[ x — m )\x — m 



V-C- 



(see [2]). Then, the 



where we note that the first part is the cross-entropy of 
Pgt and Pg, and thus, taken alone, is maximized for 8 = 
Taking the limit A — > oo, we also define the infinite- 
population IGO-ML update as 

8 t+st = arg max j (1 - St) E Pgt [\np e (x)] + StH t {8) j (10) 

where we set 

Ht(9) :=Ep et [W 6 f t (x)ln P g(x)] 
a "weighted cross-entropy" of 8 and 6 . 

Note that the finite- and infinite-population IGO-ML up- 
dates only make sense when there is a unique maximizer 
9 in ((91 and (|10|l . respectively. This assumption is always 
satisfied, for instance, for exponential families of probability 
distributions, as considered below (Statement [5J). 

The IGO-ML update is compatible with the IGO update, 
in the sense that for St —¥ the direction and magnitude of 
these updates coincide [5j Section 3]. 

The IGO-ML method is also related to the cross-entropy 
(CE) or maximum-likelihood (ML) method for optimization 
[7], which can be written as 

A 

9 t+1 = arg max Wj lnpg(iri) 



and its smoothed version which reads [7] 

A 

gt+st _ ^ _ fo^gt _j_ gj. max ^ Inpefa;, 



(11) 



Note that IGO-ML is parametrization-independent whereas 
for St ^ 1 the smoothed CE/ML method is not. Conse- 
quently, in general these updates will differ. 

2.4 IGO and IGO-ML for Exponential Fami- 
lies 

An exponential family is a set {pg;9 £ 0} of probability 
density functions pe with respect to an arbitrary measure 
da; on X defined as 



Pe(x) = 



Or) 



(12) 



where /3 = (/3i)i<i< n is the so-called natural (i.e. canonical) 
parameter, each Tt, 1 < i < n is a map Ti : X — > R such 
that {Ti, . . . , T n , x h-> 1} are linearly independent; Z(#) is 
the normalization factor. This linear independence ensures 
that the manifold of the exponential family is nonsingular. 
Many probability models, including multivariate Gaussian 
distributions, are expressed as exponential families. See [U 
Section 2.3] for examples. 

If we define 

n{8) :=E Pe [T(x)] = J T(x)p e (x)dx , (13) 

V = (?7i)i<i<n is the so-called expectation parameter. For ex- 
ample, the expectation parameter for the multivariate Gaus- 
sian distribution encodes the first moment Ep 8 [i] and the 
second moment Ep, [ra T ], Other examples can be found 
in gl Section 3.5]. 

We will repeatedly and implicitly make use of the following 
well-known fact for exponential families. 



Statement 2. Let xi,...,Xk be k points in X and let 
on,..., on- be non-negative numbers with ^ 014 — 1. Then 
the value 9 of the parameter such that the associated ex- 
pectation parameter satisfies r){9) — ^ ctiT(xi) , if it be- 
longs to the statistical manifold, is the unique maximizer of 
the weighted log-likelihood: 9 = arg maxj^ cti Inpe(xi). An 
analogous statement holds if the finite sum is replaced with 
an integral or a combination of both. 



(Uniqueness boils down to strict concavity of lnpe(x) as a 
function of 9. The restriction placed on r\ to belong to the 
statistical manifold is necessary: for instance, for Gaussian 
distributions, if the number of points k is not greater than 
the dimension of the ambient space, a degenerate distribu- 
tion 9 will result.) 

The following statement from [5] shows that the natural 
gradient of a function in the expectation parametrization 
is given by the vanilla gradient of the function w.r.t. the 
normal parameter, and vice versa. 



Statement 3 (Proposition 22 in [5]). Letg be a func- 
tion on the statistical manifold of an exponential family as 
above. Then the components of the natural gradient w.r.t. 
the expectation parameters are given by the vanilla gradient 
w.r.t. the natural parameters and vice versa, that is, 

99 and *-~- d 9 



9ft 



According to Statement [3] each component of the natural 
gradient of the log likelihood \npe(x) under the exponential 
parametrization 9 — r\ is equivalent to each component of 
the vanilla gradient, i.e., 

V Vi lnpe(x) = = T t (x) - * , (14) 

where the latter equality is well-known, e.g., [4] (2.33)]. The 
IGO update (JfjJ) under the expectation parametrization thus 
reads 

rf +st = rf + 5tE P$t [W e {(x)(T(x) - „*)] (15) 
and the natural gradient update with finite sample size reads 

A 



(16) 



Suppose as above that the selection weights sum to one: 

E Pel w e( x )} = Io w (l) d 1 = 1 and thus E™» = 1- Then, 
IGO has a close relation with the CE/ML for optimization. 
As is stated in Theorem 15 in [5], for an exponential family 
the CE/ML method (TTTJ) and the IGO instance (Hg}, when 
expressed with the expectation parametrization (9 — rj), co- 
incide with IGO-ML ©. 



Statement 4 (Theorem 15 in [5]). For optimization 
using an exponential family {Pe}, these three algorithms co- 
incide: IGO-ML; the IGO expressed in expectation param- 
eters; the CE/ML expressed in expectation parameters. That 
is, for an exponential family with the expectation parametriza- 
tion, for < 8t < 1 we have (writing in turn IGO, CE/ML 
and IGO-ML) 

A 

t+st = g t + St J2 (r( Xi ) - 9*) 

i=l 

A 

= (1 — 8t)9 t + St arg max^ Wi \npe(xi) 



(17) 



arg max-^ (1 - St)E P t [lnp$(x)] 



+ St^^Wi \npe(xi) > 



In the limit of infinite sample size A — >• oo this rewrites 
e t+st =e t + §t Epflt [ W f t {x) ( T(l) _ 

= (1 - <5i)0* + St arg maxH t (9) 

8 

= arg max|(l - 5t)E Pflt [\np g {x)] + StH t {8)} 
where we recall that H t (9) = E Pflt [w/ t (x) \np g (x)] . 



(18) 



Remark 1. Malago et al. [16] study information-geometric 
aspects of exponential families for optimization. One differ- 
ence from the IGO framework is that the optimization prob- 
lem is defined as the minimization of the expectation of the 
objective function over Pg, namely 

minE F9 [/(a;)] I 

which they call the stochastic relaxation of the original opti- 
mization problem. They study this for an exponential fam- 
ily on a discrete search space with the natural parametriza- 
tion (9 — P) and propose the natural gradient descent algo- 
rithm. Note that this requires computation of the empirical 
Fisher information matrix to perform natural gradient de- 
scent. However, if the algorithm is modified to use the ex- 
pectation parameters instead, one can compute the natural 
gradient descent directly as 

A 

r, t+st =r, t -St^2f(x t )(T(x i )-r, t ) . (19) 

i=l 

We study this algorithm in Section [4] 

3. QUANTILE IMPROVEMENT 

One possible way to provide theoretical backing for an op- 
timization algorithm is to show monotonic improvement at 
each step of the algorithm (although this is by no means nec- 
essary: e.g., for stochastic algorithms, this is not expected 
to hold at each step). For example, consider the sphere func- 
tion / : x i— > ||a;|| 2 . Then, it is easy to show that the gradi- 
ent steps x t+5t = x* — St V x f{x t ) generate a monotonically 
decreasing sequence {f{x t )}t>o provided < St < 1/2. For 
any smooth function, infinitesimal gradient steps are guaran- 
teed to improve the objective function values; but in general 
the admissible step size strongly depends on the function 
and has to be adjusted by the user. 

When it comes to the counterpart in IGO, however, we fol- 
low the gradient of the function J{6\9 t ), which depends 
on 9 l , so that step-by-step improvement in the objective, 
J(9 t+1 | #*) > J{9 1 | #'), does not necessarily mean improve- 
ment. (It might happen that J(0* | 9 t+1 ) > J{9 t+1 \9 t+1 ) 
and J(6 t+1 \ 0*) > J{9 t \ 0') at the same time.) 

A key feature of the IGO framework is its invariance under 
changing the objective function / by an increasing transfor- 
mation (e.g. optimizing f 3 instead of /). Thus, any measure 
of progress that is not compatible with such transformations 
(e.g. the expectation Ep et /) is not a good candidate to al- 
ways improve over the course of IGO optimization. 

As a measure of improvement, Arnold et al. [5] use the no- 
tion of q-quantile of f. The g-quantile Q q P {f) of / under 
a probability distribution P is any number m such that 
P[x : fix) < m] > q and P[x : f(x) > m] > 1 — q. For 
instance, Q q P {f) is the median value of / under P if q = 1/2. 
For smooth distributions and continuous / there is only one 
such number m, but in general the set of such m may be a 
closed interval, for instance if / has "jumps". For the sake 
of definiteness let us use the largest such value: 

QpU) ■= SU P {m£R: P[x: f{x) < m] > q 

and P[x : f(x) > m] > 1 - g} . 



(This is because we want to minimize the objective function 
/; when IGO is used for maximization instead, Theorem [6] 
has to be written using an infimum in the definition of Q P (f) 
instead.) 

It is proven in [5] that when using the g-truncation selection 
scheme, the g-quantile value of / monotonically decreases 
along infinitesimal IGO steps. 

Statement 5 (Proposition 5 in [5]). Consider the q- 
truncation selection scheme w(u) = l[ u < q ]/q where < q < 
1 is fixed. Then the infinitesimal IGO steps ((6]) where St 
is infinitesimal leads to monotonic improvement in the q- 
quantile of f: Qj,^ (/) < Q q Pgt (/). 

3.1 Quantile Improvement in IGO-ML 

In practice, explicit algorithms do not use continuous time 
with infinitesimal time steps: the time step St may be quite 
large and its calibration may be an important issue. It is 
more interesting and important to see how long steps we 
can take along the natural gradient, i.e. how large a St we 
can choose while guaranteeing g-quantile improvement. 

When using IGO-ML (and thus when using IGO or CE/ML 
on an exponential family with the expectation parametriza- 
tion), we can obtain such a conclusion; the size of the steps 
may even be chosen independently of the objective function. 

Theorem 6. Let the selection scheme be w(u) = l[ u < q ]/q 
where < q < 1. Assume that the arg max defining the 
IGO-ML step (JTDJ is uniquely determined. Then for < 
St < 1, each infinite-population IGO-ML step (1101) leads to 
q-quantile improvement: Q q p (/) < Q q p t (f). 

Moreover, equality can hold only if P B t+u = Pgt or ifP g t+st [x : 
f(x) = Q q Pgt (f)]>0. 

Corollary 7. For exponential families written in expec- 
tation parameters, on any search space, the same holds for 
the CE/ML method and for the IGO algorithm. 

Note that the first condition for equality means the algo- 
rithm has reached a stable point. 

The second condition for equality typically happens for dis- 
crete search spaces: on such spaces, the g-quantile evolves 
in time by discrete jumps even when 9 l moves smoothly, so 
we cannot expect strict quantile improvement at each step. 
On the other hand, with continuous distributions on contin- 
uous search spaces, the second equality condition can only 
occur if the objective function has a plateau (a level set with 
non-zero measure). 

Proof. KP jH « = P e t, obviously Q q P$t+tt if) = Q\ t U)- 
Hereunder, we assume Pgt+st 7^ Pgt . 

Consider the function J (9 \ 9 l ) defining the expected Pgt- 
adjusted fitness of a random point under Pg: 

j(e\e t ) = M Pe [wf t (x)] 



and remember that J(0* | 0') = 1. The idea is as follows: 
letting Y be the set of points with Pgt (Y) = q at which the 
objective function / is smallest (the sublevel set of / with 
Pgt-mass q), then with our choice of w, WL(x) is (up to 
technicalities) equal to l/g on F and elsewhere, so that 
J(0 | 0*) represents 1/q times the Pg-probability of falling 
into Y (hence J(0* | 0*) = 1). Thus J(0 | 0*) > 1 will mean 
that the Pg -probability of falling into Y is larger than q, so 
that Pg improves over Pgt and the g-quantile has decreased. 

We are going to prove that the IGO-ML update satisfies 
J(6 t+6t I 0*) > 1 if Pgt Pgt+st . More precisely we prove 
that 

J(6 t+St I 0*) > exp n^L DK L(P e t ||P e «+«)) ■ 

This will imply quantile improvement, thanks to the follow- 
ing lemma, the proof of which is postponed. 

Lemma 8. Let the selection scheme w be as above. If 
J(8 t+6t |0*) > 1, then Q q p f & (/) < Q q P t (f). If moreover 
P et+St [x : /(X) = Q% at (/)]"= 0, then QfJ^ (/) < Q" Pgt (/). 

The lower bound on J(6 t+St | 0*) is obtained as follows. Since 
f Wg t (x)pgt(x)dx = 1 and Wg t (x)p g t(x) > for any x, 

WL ( x )Pe' ( x ) can be viewed as a probability density func- 
tion. Since In is concave, by Jensen's inequality we have 

InJ(0|0*)=ln / P^\w g f t (x)pg t (x)dx 

= H t {6) - H t (e*) . 
Thus, if H t (6) > H t (0*) we have J(0 | 0*) > 1. 

Now, according to (|10[l , Q t+St uniquely maximizes the quan- 
tity (l-St)E Pgt [lnp e (x)] +8tHt(9). Therefore, if 6 t+st 0*, 
we have 

(1 - 8t)Ep ot [lnp fl «+« (x)] + StH t (6 t+st ) 

> (1 - St)E Pgt [lnp 0t (x)] + StHtQP) 
and rearranging we get 

H t (6 t+St ) -ff t (0*) 

> (Ep flt [In p e t (a;)] - E P()t [\np g t+st (x)]) 

= i-^D KL (P 9t ||P««+*) • (21) 

The right-hand side of this inequality is non-negative for 
< 5t < 1. 

This will prove the theorem once Lemma[8]is proved, which 
we now proceed to do. □ 

Proof of Lemma [SJ Hereunder, we abbreviate m for the 
q-quantile value Q q p ( (/) of / under P t . 



Let us compute the weighted preference Wg t (x). Since the 
selection scheme w satisfies < w(u) < 1/q for all u € [0; 1], 
we have < WL(x) < 1/q for any x. 

We claim that f(x) > m implies WL (x) — 0. Indeed, sup- 
pose that x is such that f(x) > m. Since by definition m 
is the largest value such that Pgt [y : f(y) > m] > 1 — q, we 
must have P g t[y : f{y) > f{x)] < 1-q. Hence P g t[y : f(y) < 
f{x)] > q, i.e., qf t { x ) > 1- Now this implies WL(x) = for 
our choice of selection scheme w. 

Thus W g f t (x) is at most 1/q and vanishes if f(x) > m. For 
any probability distribution Pg, this implies that 

J(0 I 0*) = E Pe [W g { (x)} < - P e [x : f(x) < m] . 

Therefore, 

J(0 [ 0*) > 1 Pg[x : f{x) <m]>q 

If moreover Pg[x : f(x) — m] — 0, we have Pg[x : f(x) < 
m] = Pe[s : /(a;) < m] hence 

J(0 J ) > 1 P e [s : /(as) < m] > q 

Pg[x : f{x) >m]<l-q 
^Q q Pe (f)<m . 

Altogether, J(9 t+St | 0*) > 1 implies quantile improvement 
Qp et+St (f) < Qp et (f)- Moreover, if P gt+St [x : f(x) = 
rn] = 0, we have strict quantile improvement Q q p (/) < 

<&„(/)■ □ 

This completes the proof of Theorem [6] 

Example 3. Bernoulli distributions constitute an exponen- 
tial family where the sufficient statistics Ti(x) are Xi. The 
parameter used in PBIL (Example [TJ is indeed the expec- 
tation parameter. Thus, PBIL is an instance of IGO-ML 
and can be viewed as a CE/ML method at the same time. 
Hence, by Theorem [S] each infinite-population PBIL step 
leads to q-quantile improvement if we employ g-truncation 
selection, which is not the same as the exponential weights 
introduced in [6]. 

Remark 2. The proof of the theorem is quantitative: the 
Kullback-Leibler divergence DKL(Pgt || Pgt+st ) indicates how 
much progress was made. More precisely (assuming for sim- 
plicity a continuous situation with no plateaus), while the 
probability under Pgt to fall into the best q percent of points 
for Pgt is q by definition, the probability under Pgt+st to 
fall into the best q percent of points for Pgt is at least 
(? exp(i^ J D K L(P et \\ Pgt+st))- 

3.2 Blockwise IGO-ML 

The expectation parameter is not always the most obvious 
one. When it comes to multivariate Gaussian distributions, 
the expectation parameter is the mean vector and second 
moment, (m, mm T + C). Meanwhile, the CMA-ES and the 



CE/ML method for continuous optimization parametrize 
the mean vector and covariance matrix, hence they differ 
from the fGO-ML algorithm. Moreover, sometimes differ- 
ent step sizes (learning rates) are employed for each param- 
eter, which makes the direction of parameter update differ- 
ent from that of the natural gradient. Here, we justify some 
of these settings by guaranteeing g-quantile improvement in 
an extended framework. 

We define an extension of IGO-ML, blockwise IGO-ML, that 
recovers the pure rank-^i CMA-ES update with different 
learning rates for m and C. 



Definition 1. Let 9 — (61,... ,9k) be any decomposition 
of the parameter 9 into k blocks, and let {Sti, . . . , Stk} be 
a step size for each block. For 1 < j < k, define the j-th 
block partial IGO-ML update with step size Stj as the map 
sending a parameter value 9 to &j(9) where 

$j(0) := arg max i (1 - Stj)E Pg [lnpg*(x)] 
»* I 

9*=Si for all i^j v 



+ Stj Y^Wi In Pe«(a;i)j . (22) 



The blockwise IGO-ML updates the parameter 9 as follows. 
Given a current parameter value Q % , update the first block 
of #*, then the second block, etc., in that order; explicitly, 
set 



9 t+1 := ($ fc o--.o$ 2 o$ 1 )(6l t ) , 



(23) 



where we note that the same Monte Carlo sample {xi} from 
Pgt is used throughout the whole range of block updates 
$i, . . . ,$ fe . 



The infinite-population step (A = oo) reads the same with 



4>j(0) := arg max i (1 - Stj)E Pg [lnp e *(x)] 

6*=6i for all i^j v 

+ 5t j E Pet [W 9 f t (x)lnp e *(x)] 1 . (24) 



As before, the finite- and infinite-population blockwise IGO- 
ML updates only make sense if the arg max in (|22[) or (|24[l 
is uniquely determined. 

Note that the blockwise IGO-ML depends on the decompo- 
sition of the parameters into blocks and their update order, 
while it is independent of the parametrization inside each 
block. Blockwise IGO-ML is not necessarily equivalent to 
IGO-ML even when all Sti are equal to St. 



Proof. Given 9 l = (C\ m 4 ), blockwise IGO-ML first up- 
dates C as follows: 



C* = arg max (1 - Stc)E P t mt [\np {c , m t)(xj] 



+ Stc^Wi lnp ( c,m')(a;i)| 



(25) 



Considering {-P(c,m*)} as an exponential family of Gaussian 
distributions whose mean vector is fixed to m 4 , (|25|l can 
be viewed as an ordinary IGO-ML step for this restricted 
model. Then, since (after shifting the origin of the coordi- 
nate system to m 1 ) C is the expectation parameter of the 
restricted model, the update is given by (|17[) namely 

A 

C* — C* + Stc ^2 Wi((xi - m t )(x i - m') T - C*) . 
Next, m is updated as 



m* = arg max <^ (1 - St m )E P mt [lnp (c , tm) (x)] 

m 

+ Stm ^2 Wi lnp( C *,m)(xi)\ ■ 

To derive m* , let us differentiate the inside of arg max w.r.t. 
m and derive the zero point of the derivative. Seeing that 
V m \np(c* ,m)(x) = (C*)~ 1 (x — m), we find the condition 

{l-8t m ){C*y 1 {m~m*)+8t m Y;™^ C *)~ 1 ( x *- m *) = - 

i 

which holds if and only if 

A 

m* = m* + St m y ' uii(xi — m 4 ) . 
i=i 

This is equivalent to the pure rank-^i CMA-ES update. □ 



Quantile improvement as in Theorem [6] readily extends to 
this setting as follows. 



Theorem 10. Let the selection scheme be w(u) = 1{ u < q \/q 
where < q < 1. Assume that the arg max defining each 
partial infinite-population IGO-ML update (12411 is uniquely 
determined. Then for < Stj < 1 (j G [1; k\), each infinite- 
population blockwise IGO-ML step (|23f) leads to q-quantile 
improvement: Q q p t+j (/) < Q q p f (/). 

Moreover, equality can hold only ifPgt+i = Pgt or ifPgt+i [x : 
f(x) = Q% <f)]>0. 



Proposition 9. The pure rank-fi CMA-ES update (Ex- 
ample [J|) is an instance of blockwise IGO-ML for Gaus- 
sian distributions, with parameter decomposition 9 = (61,62) 
where 9i = C , the covariance matrix, and 9i = m, the mean 
vector. 



Consequently, each infinite-population step of the pure rank- 
jj, CMA-ES update guarantees g-quantile improvement. In- 
deed, from Proposition [9] this variant of the CMA-ES is an 
instance of blockwise IGO-ML. Moreover, if each level set of 
/ has zero Lebesgue measure, which often holds for contin- 
uous optimization, we have strict q-quantile improvement. 



Proof. If P 0t+1 = P e t, obviously Q% (/) = Q" p (/). 
We assume Pgt+i 7^ Pgt in the following. 

Set 9 tfi := 9* and 9 t - i := ^(fl*^" 1 ) so that 9 t+1 = 6 uk . 
According to Lemma [51 to prove quantile improvement it is 
enough to show that J(8 t+1 | > 1. Moreover, this im- 
plies strict quantile improvement provided Pgt+i [x : f(x) = 

Q^(/)] = o. 

According to ([20]). if Ht(9 t+1 ) > H t {f) we have J(6 t+1 | 9*) > 
1. To show that H t (9 t+1 ) > H t (0*) we decompose H t (8 t+1 ) \- 
Ht{9 ) into the sum of partial differences, namely, 

k 

H t (8 t+1 ) - Htie 1 ) =^2H t (e tJ ) - H^- 1 ) , 

3=1 

and we will prove that each term is non-negative. More- 
over, if P gt ,j / P g t, j-i for some j £ [l;fc], we will have 
H t (9 tJ ) - H t (8 t ' J - 1 ) > for this j. Since P flt+ i / P s * im- 
plies P e t, j 7^ P g t, j-i for at least one j G [1; fc], we will have 
that # t (0 t+1 ) -Ht(0*) > 0, resulting in J(0 t+1 | 0') > 1. 

We proceed as in Theorem H Since 9 tJ = ^(fl*' 3 *" 1 ) is the 
only maximizer of ([24}, we have 

(1 - ft^Ep^., [ln Pflt , (a,)] + dtjHtieU) 

> (1 - ftjJEp^ pnpjtj-ifa;)] + St^*'^ 1 ) 

with equality holding if and only if ' J = 6 |t ' : '~ . Rearrang- 
ing, we get 

Ht(0* J ) -HtiO^- 1 ) > ^^D KL (Pg^-i \\P 9 t,i) 
Stj 

if tJ ^ tJ -\ and # t (f9 tJ ) = H t (9 t ' :i ~ 1 ) if ( ' J = fl*^" 1 . 
The right-hand side of the above inequality is non-negative 
for < 5t < 1. Therefore, fft(0*' 3 ') - fft^*' 3 ' -1 ) > for all 
j £ [1; fc]. Moreover, since Pgt+i 7^ Pgt, for at least one j £ 
[1; fc] we have 0*' J / fl*'^ 1 and thus #t(0*' J ') - H^*'^ 1 ) > 
for this j, implying that # t (0 t+1 ) - #t(0*) > 0. This 
completes the proof. □ 

3.3 Finite Population Sizes 

The results above are valid for "ideal" updates with infinite 
sample size. With finite sample size, the update ([9]) defines a 
stochastic sequence (depending on the random sample {xi}) 
and so one cannot expect monotone g-quantile improvement 
at each step. Still, we can expect g-quantile improvement 
with high probability when the population size is sufficiently 
large. 

We provide an analogue of Theorem [6] for finite but large 
population size. A similar statement holds for blockwise 
IGO-ML. The proof follows a standard probabilistic approx- 
imation argument. 

Proposition 11. Let w(-) be the q-truncation selection 
scheme: w(u) — l[ u < q ]/q where < q < 1. Let {Pg} be an 
exponential family of probability distributions, parametrized 
by its expectation parameter. Assume that the arg max defin- 
ing the infinite-population IGO-ML step (|10|) is uniquely de- 
fined. 



Assume that for all 9 £ Q, the derivative din Pg(x)/d9 ex- 
ists for Pg -almost all x £ X and has finite second moment: 
E Pe [\d\nPg{x)/d9\ 2 ] < 00. 

Let < St < 1. Let 9 t x +st be the IGO-ML update © with 
sample size X, and let 9 t c £ St be the infinite-population IGO- 
ML update (fTO)). Assume that 0£+ 5 * 7^ 9*. 

Then, with probability tending to 1 as X — > 00, the finite- 
population update 9^^ results in q-quantile improvement: 

Q\ t+st U)<Q\M) ■ 

"x 

Consequently, the same holds for the CE/ML method and 
the IGO algorithm when they are applied to an exponential 
family using the expectation parameters. 

Note the assumption that the ideal dynamics has not reached 
equilibrium yet: 9t+ st =£6*. If 9 f + St = 9*, the finite- 
population dynamics will just randomly wander around this 
equilibrium value with some noise, resulting in either im- 
provement or deterioration at each step. 

Also note that the population size A needed may depend on 
the current location 9 l in parameter space, as well as the 
objective function /. For instance, highly oscillating func- 
tions / likely require higher population sizes for a consistent 
estimation of the IGO-ML update. 

Proof. For exponential families, the IGO and IGO-ML 
updates coincide. Under the conditions of the theorem, the 
finite-population IGO update ([HJ is a consistent estimator of 
the infinite-population IGO update (f5]) _5, Proposition 18], 
implying that 9 t x ^ St converges with probability one to 9 t a ^ St . 
Under our regularity assumptions on Pg, this implies point- 
wise convergence of p g t+st to p g t+st , which, since Wg t (x) is 
bounded, leads to 

J(0{ +5t \e t ) = K P [wl(x)] 

"x 

->■ Ep fl t+ st [Wj t (x)] = J(9 t + St I 9*) as A 00. 

Now the right-hand side is greater than 1 for < St < 1 
unless 9^ st = 9 t , as we have shown in the proof of Theo- 
rem H Thus, we have J(9 t ) ^ st | 0*) > 1 with high probability 
for sufficiently large A. Thus Lemma [5J entails g-quantile 
improvement with high probability. □ 

4. FITNESS-PROPORTIONAL SELECTION 

These results carry over to the use of a composite g o / of 
a function g with the objective function /, as a selection 
weight instead of Wg t in the IGO framework. This cov- 
ers, for instance, fitness- proportional selection (g = Id). We 
prove that, when considering the natural gradient ascent for 
an exponential family ()12|) using the expectation parameter 
(|13[) . we can guarantee monotone Ep e [(? o /(a^)]-value im- 
provement for updates of step size inversely proportional to 
Ep e [g o f(x)]. More precisely, 



Theorem 12. Assume g o / is non-negative and not ev- 
erywhere 0. Consider the update 



nt + St 



= 0* + StE T 



9° fix) 



Ep et [g ° f(x)] 



(t(x) - e*) 



(26) 



where 8 — r\ is the expectation parameter of the exponential 
family {Pg}. 



Then for < St < 1, we have 

Ep et+5t [g o fix)] > E Pot [g o f{x)] 
Moreover, equality can occur only if P s t+st = Pgt . 



Gradient based methods with fitness-proportional selection 
are often employed, especially in reinforcement learning, e.g. 
policy gradient with parameter based exploration (PGPE) 
[17] . One disadvantage of gradient based methods is that the 
step size has to be calibrated by the user depending on the 
problem at hand. Alternative methods such as expectation- 
maximization [9], including the RPP below, are sometimes 
employed to avoid this issue. Theorem 1 121 however, ensures 
that each natural gradient step improves the expected fit- 
ness for < St < 1 when an exponential family is used with 
its expectation parameters. 



Example 4. The Relative Payoff Procedure (RPP) [12] is 
a reinforcement learning algorithm, also known as expectation- 
maximization (EM) algorithm for reinforcement learning [9]. 
The RPP expresses a policy on the action space X = {0, l} d 
by a Bernoulli distribution Pe (x) parametrized by the expec- 
tation parameter. The objective to be maximized is the ex- 
pectation Ep e [r(a:)] of non-negative reward r(i) after taking 
action x 6 X. The RPP updates the parameters to 



Ep, t [xrjx)] 

Hx)\ 



Remember the sufficient statistics T{x) for Bernoulli distri- 
butions are Tj(a;) = Xi. Thus the RPP is equivalent to (|26|l 
with g o fix) = r(x) and St = 1 and can be viewed as a 
natural gradient ascent with large step. 

The RPP is known from [5] to monotonically improve ex- 
pected reward, thanks to its expectation-maximization in- 
terpretation. Theorem [T^] can be thought of as an extension 
of this result, and also shows monotone improvement for the 
smoothed RPP, where a step size < St < 1 is introduced. 

Proof. Most of the proof of Theorem [6] carries over. Re- 
placing W/ t in flU} with g°f/E Pf)t [go fix)], (THJ still holds 
and we have 



e t+st = Q t + Sf Ep 



9° fix) 



Ep 9t [go fix)] 
= m-a max <j (1 - St)E Pgt [lape(x)] 

g fix) 



(t(x) - e*) 



(27) 



+ StEp 



E Pgt [g°fix)] 



Inpeix) 



(pO) as 

lnE Pe [gofix)]-lnEp st [gofix)] 
^ Ep Bt [go fix) Inpeix)] E Pt \gof{x)lap e t(x)] 



Ep flt [g o fix)] 



Ep flt [g o fix)] 



(28) 



Because of the second equality of (|27p . we have the counter- 
part of (|21[) as 



B t b fix) In p e t ix)] 



E P e t [g o fix)lnp gt+S tjx)] 

Ep et [g o f{x)] 



Ep flt [g o fix)] 
1 - St 



St 



-£>KL(P e t \\Pet+a), 



and moreover, since the maximizer in ()27l) is unique, the 
inequality is strict unless 6* — t+St . Hence, since the right- 
hand side is non-negative, by ()28|) we have In Ep 9 [go fix)] > 
lnEp,[p o fix)] with equality only if P e t — P gt +st ■ This 
completes the proof. □ 



Remark 3. As mentioned in Remark [T] Malago et al. [16] 
propose the natural gradient algorithm for discrete opti- 
mization using exponential distributions. However, as they 
parametrize the exponential distributions by the natural pa- 
rameters 8 = f3, Theorem [12] does not guarantee expected 
fitness improvement for their algorithm, whereas it does so 
for the algorithm (|19[) using the expectation parameters. 



5. FURTHER DISCUSSION 

These results contribute to bringing theory closer to prac- 
tice, by waiving the need for infinitesimal step sizes in gra- 
dient ascent. Still, they cover only the "ideal" situation with 
infinite population size, as well as finite but very large pop- 
ulation sizes (by a standard probabilistic approximation ar- 
gument). Finite population sizes lead to stochastic behavior 
and so monotone objective improvement at each step occurs 
only with high probability. 

In practice, population sizes used can be quite small, A < 10, 
with medium to small step sizes [6j[lT]. It has been shown 
in [T] Remark 2] that when population size does not tend to 
infinity, the expectation of the natural gradient estimate ([8]l 
is the natural gradient ([5]) with a different selection scheme 
w. So using the truncation weight w(«) = l[ u > 9 ] with a 
small population size and very small step sizes will result, by 
the machinery of stochastic approximation 8 ,14], in simu- 
lating an infinite-population IGO step with another selection 
scheme, a situation outside the scope of this article. Our re- 
sults, on the contrary, suggest using larger populations and 
larger step sizes instead. 

Finally, let us stress that objective improvement is not, by 
itself, a sufficient guarantee that optimization performs well: 
in situations of premature convergence, the objective still 
improves at each step. Premature convergence can occur 
for large values of the learning rate in some instantiations 
of IGO and IGO-ML (see the study in [5]); our results say 
nothing about this phenomenon. 



Thanks to Jensen's inequality, we have the counterpart of 
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